Overall Map
Overall Map
Chapter 1: Basic Concepts
Concepts: state,action,reward,return,episode,policy
Grid-world example
Markov decision process(MDP)
Chapter 2: Bellman Equation
One concept: state value
One tool: Bellman equation
Chapter 3: Bellman Optimality Equation
A special Bellman equation
Two concepts: optimal policy & optimal state value
One tool: Bellman optimality equation
- Fixed-Point theorem
- Fundamental problems
- An algorithm solving the equation
Chapter 4: Value Iteration & Policy Iteration
First algorithms for optimal policies
Three algorithms:
- Value iteration(VI)
- Policy iteration(PI)
- Truncated policy iteration
Need the environment model
Chapter 5: Monte Carlo Learning
Mean estimation with sampling data
First model-free RL algorithms
- MC Basic
- MC Exploring Starts
- MC -greedy
Chapter 6: Stochastic Approximation
Gap: from non-incremental to incremental
Mean estimation
Algorithm:
- Robbins-Monro (RM) algorithm
- Stochastic gradient descent(SGD)
- SGD,BGD,MBGD
Chapter 7: Temporal-Difference Learning
-
TD learning of state values
-
Sarsa: TD learning of action values
-
Q-learning : TD learning of optimal action values
on-policy & off-policy
-
Unified point of view
Chapter 8: Value Function Approximation
Gap: tabular representation to function representation
Algorithms:
-
State value estimation with value function approxmation(VFA):
-
Sarsa/W-learning with VFA
Deep W-learning
Chapter 9: Policy Gradient Methods
Gap: From value-based to policy-based
-
Metrics to define optimal policies
-
Policy gradient:
-
Gradient-ascent algorithm(REINFORCE)
Chapter 10: Actor-Critic Methods
Gap: policy-based + value-based
Algorithms:
-
The simplest actor-critic(QAC)
-
Advantage actor-critic (A2C)
-
Off-policy actor-critic
Importance sampling
-
Deterministic actor-critic(DPG)